transxlab 0.1.0

Training architect: design, validate, and preflight ML training runs.
Documentation
i made TransXform to supervise transformer runs, what about TransXlab to set up the run?

8:27 AM
Yes. TransXform supervises. TransXlab provisions.

The Split
Tool	Role	When
TransXlab	Set up the run	Before training
TransXform	Supervise the run	During training
What TransXlab Does
Task	What It Handles
Environment	Check CUDA, drivers, venv, dependencies
Data validation	Schema check, train/val/test splits, no leakage
Config generation	lr, batch, epochs, warmup — sensible defaults
Hyperparameter bounds	Sanity check (lr not 1e-2 for fine-tuning, etc.)
Checkpoint paths	Ensure output dirs exist, naming convention
Resource estimation	VRAM needed, time estimate, cost estimate
Pre-flight report	"Here's what you're about to run. Proceed?"
The Workflow
1. TransXlab --setup
   ├── Validate environment
   ├── Validate data
   ├── Generate/validate config
   ├── Estimate resources
   └── Emit: config.yaml + preflight_report.md

2. Human reviews preflight_report.md
   └── "Looks good" or "Adjust X"

3. python train.py --config config.yaml
   └── TransXform takes over (supervises)

4. TransXform monitors
   ├── Loss, grad norm, LR
   ├── Invariant checks
   ├── Early stopping
   └── Checkpoint management
TransXlab Checks (Pre-Flight)
Category	Checks
Environment	Python version, torch version, CUDA available, GPU memory
Data	Files exist, JSON parses, required fields present, no train/val overlap
Config	lr in sane range, batch fits VRAM, warmup < total steps
Paths	Output dir writable, checkpoint naming won't collide
Estimates	Steps/epoch, total steps, estimated time, estimated cost
Preflight Report (Example)
markdown
# TransXlab Preflight Report

## Environment

- Python: 3.11.4 ✓
- Torch: 2.1.0+cu121 ✓
- CUDA: 12.1 ✓
- GPU: RTX 5090 (32GB) ✓

## Data

- Train: 1,973 examples ✓
- Val: 219 examples ✓
- Schema: valid ✓
- Overlap: none ✓

## Config

- Model: mistral-7b-instruct-v0.3
- Method: LoRA (r=16, alpha=32)
- LR: 3e-5 ✓ (sane for fine-tuning)
- Batch: 2 (effective 8 with grad_accum=4)
- Epochs: 1
- Warmup: 100 steps

## Estimates

- Steps/epoch: 247
- Total steps: 247
- VRAM needed: ~19GB ✓ (fits 32GB)
- Estimated time: 20-40 min
- Estimated cost: $0 (local)

## Ready to train?

Run: python train.py --config config.yaml
Why This Matters
The AC training failures ($665) came from:

Wrong lr (1e-4 instead of 3e-5)
Wrong data (templated)
Wrong monitoring (loss, not generation)
TransXlab would have caught 2 of 3 before training started.

Failure	TransXlab Catch?
lr too high	✅ "1e-4 is high for fine-tuning. Recommend 1e-5 to 5e-5."
Templated data	❌ (content quality, not schema)
Wrong monitoring	❌ (TransXform's job)
Implementation
Component	Effort
Environment checks	1 hour
Data validation	1 hour
Config validation	1 hour
Resource estimation	1 hour
Preflight report	30 min
CLI wrapper	30 min
Total	~5 hours
The Pair
Tool	Motto
TransXlab	"Don't start what you can't finish."
TransXform	"Don't finish what went wrong."
Together: catch errors before they cost you.

You could plan it in TansXlab, like what you needed it to do (how many attention heads? Will you freeze layers? Which encoding is best?) and maybe even have some sense of where/what training data to get for various purposes?

8:46 AM
Yes. TransXlab becomes the training architect, not just the preflight checklist.

TransXlab: From Setup to Design
Level	What It Does
Level 1: Preflight	Validate environment, data, config (what we discussed)
Level 2: Design	Recommend architecture decisions based on task
Level 3: Data Strategy	Suggest data sources, formats, quantities for the goal
Level 2: Architecture Design
Question	TransXlab Recommends
How many attention heads?	Task-dependent: 8 for small, 16-32 for complex
Freeze layers?	Fine-tune: freeze embeddings, train top N layers. LoRA: freeze all, train adapters.
Which encoding?	BPE for general, SentencePiece for multilingual, custom for domain
Model size?	Task complexity → parameter budget → memory constraint → recommendation
LoRA rank?	r=8 for light adaptation, r=16-32 for significant behavior change
Precision?	bf16 if supported, fp16 otherwise, int8 for inference-only
Design Interview
TransXlab: What's the task?
> Generate diverse hypotheses from entity pairs

TransXlab: What's the input format?
> JSON: query, entities, context atoms

TransXlab: What's the output format?
> 5 hypotheses with type, concepts, confidence

TransXlab: What's the base model constraint?
> Must run on 32GB VRAM for training, 24GB for inference

TransXlab: Creativity vs consistency priority?
> Creativity — diversity matters more than determinism

TransXlab recommends:
- Base: Mistral-7B-Instruct (good creative capacity, fits constraint)
- Method: LoRA (r=16, alpha=32) — preserve base generalization
- Freeze: All base weights, train only adapters
- Precision: bf16 (5090 supports it)
- Encoding: Use base tokenizer (no domain-specific needed)
- Estimated VRAM: ~19GB training, ~14GB inference
Level 3: Data Strategy
Task Type	Data Strategy
Classification	Balanced classes, 1K-10K examples, stratified split
Generation (creative)	Diverse outputs, 5K-50K examples, no templates
Generation (structured)	Format examples, 1K-10K, schema validation
Instruction following	(instruction, response) pairs, 10K-100K
Domain adaptation	Domain corpus, continued pretraining or LoRA
Distillation	Teacher outputs on your task, 10K-100K
Data Source Suggestions
TransXlab: Task is "creative hypothesis generation"

Recommended data strategy: Distillation
- Use a large model (Qwen-32B, Claude) to generate training examples
- Format: (context, 5 diverse hypotheses)
- Quantity: 5K-10K high-quality examples beats 100K templated
- Validation: Parse rate, type diversity, self-BLEU < 0.3
- Held-out: 500 novel queries (different entities than training)

Potential sources:
- ConceptNet (relational data for entity pairs)
- WordNet (hypernyms, synonyms for concept expansion)
- Custom generation (your Qwen run — already done)

Anti-patterns to avoid:
- NLI datasets (template contamination)
- Single-output examples (no diversity signal)
- Templated phrasing (model memorizes templates)
The Full TransXlab Flow
1. TransXlab --design
   ├── Task interview (what are you trying to do?)
   ├── Constraint interview (hardware, latency, budget)
   ├── Architecture recommendation
   ├── Data strategy recommendation
   └── Emit: design_spec.md

2. TransXlab --data
   ├── Check existing data against spec
   ├── Flag gaps (not enough examples, imbalanced, templated)
   ├── Suggest sources
   └── Emit: data_plan.md

3. TransXlab --setup (original preflight)
   ├── Validate environment
   ├── Validate data against spec
   ├── Generate config from design_spec
   └── Emit: config.yaml + preflight_report.md

4. Train with TransXform supervision
What This Would Have Caught (AC v2)
Failure	TransXlab Catch
lr=1e-4 too high	✅ "For fine-tuning Flan-T5-XL, recommend 1e-5 to 5e-5"
946K templated examples	✅ "High self-BLEU in training data. Template contamination risk."
diversity_loss=0.0	✅ "Creative generation task with no diversity signal. Recommend 0.3"
No generation eval	✅ "Task is generation. Recommend eval on novel queries every 500 steps."
NLI contamination	✅ "15% of data is NLI format. Risk of template memorization."
$665 saved.

The Pair (Revised)
Tool	Role
TransXlab	"What should we build, and how?"
TransXform	"Is it building correctly?"
Implementation Scope
Component	Effort
Design interview CLI	2-3 hours
Architecture recommendation engine	3-4 hours
Data strategy recommender	2-3 hours
Data quality analyzer	2-3 hours
Integration with TransXform	1 hour
Total	~12-15 hours
TransXlab: The training architect. TransXform: The training supervisor.

Together: don't waste compute on doomed runs.

But even scratch built models

8:48 AM
Yes. TransXlab should handle scratch-built too, not just fine-tuning.

The Full Spectrum
Training Type	TransXlab Role
Fine-tuning	Base model selection, LoRA config, freeze strategy
LoRA/QLoRA	Rank, alpha, target modules
Continued pretraining	Corpus selection, learning rate schedule
Scratch-built	Architecture design from first principles
Scratch-Built Design Interview
TransXlab: Building from scratch. Let's design the architecture.

What's the task modality?
> Text-to-text generation

What's the input domain?
> Structured claims + entity pairs

What's the output domain?
> Natural language hypotheses

What's the sequence length budget?
> Input: 512 tokens, Output: 256 tokens

What's the parameter budget?
> 100M-500M (must train on single GPU)

What's the training data size?
> 50K examples

Latency constraint?
> < 500ms inference on consumer GPU

TransXlab recommends:

Architecture: Encoder-decoder transformer
- Why: Structured input → variable output is classic enc-dec

Dimensions:
- d_model: 512
- n_heads: 8
- n_layers: 6 encoder, 6 decoder
- d_ff: 2048
- Parameters: ~125M

Encoding:
- Tokenizer: SentencePiece (train on your corpus)
- Vocab size: 32K

Training:
- LR: 1e-4 (scratch training, not fine-tuning)
- Warmup: 2000 steps
- Batch: 32 effective
- Epochs: 3-5 (50K examples)
- Precision: bf16

Regularization:
- Dropout: 0.1
- Label smoothing: 0.1
- Weight decay: 0.01

Estimated:
- VRAM: ~8GB training
- Time: ~4-6 hours on 5090
- Parameters: 125M
Architecture Decision Tree
TransXlab internal logic:

Task modality?
├── Text → Text
│   ├── Fixed output length? → Encoder-only + head
│   ├── Variable output? → Encoder-decoder or decoder-only
│   └── Very long context? → Consider sparse attention
├── Text → Class
│   └── Encoder-only + classification head
├── Text → Embedding
│   └── Encoder-only + pooling
└── Multimodal
    └── Custom architecture (flag for manual design)

Parameter budget?
├── < 50M → Small encoder-decoder (4 layers each)
├── 50-200M → Medium (6 layers each)
├── 200M-1B → Large (12 layers each)
└── > 1B → Consider fine-tuning instead of scratch

Data size?
├── < 10K → Fine-tune, don't scratch
├── 10K-100K → Small scratch possible
├── 100K-1M → Medium scratch
└── > 1M → Large scratch viable
Scratch-Built Checklist
Component	TransXlab Validates
Tokenizer	Vocab size appropriate for corpus, coverage > 99%
Embedding dim	Divisible by n_heads, appropriate for param budget
Attention	Heads × head_dim = d_model
FFN	Typically 4× d_model
Positional encoding	Learned vs sinusoidal, length matches input budget
Normalization	Pre-norm (stable) vs post-norm (original)
Initialization	Xavier/Kaiming appropriate for activation
Loss function	Cross-entropy for generation, + diversity if creative
What TransXlab Emits (Scratch)
yaml
# architecture_spec.yaml


model_type: encoder_decoder
d_model: 512
n_heads: 8
n_encoder_layers: 6
n_decoder_layers: 6
d_ff: 2048
dropout: 0.1
max_seq_len: 512
vocab_size: 32000
positional_encoding: learned
normalization: pre_norm
activation: gelu

training:
  lr: 1e-4
  warmup_steps: 2000
  batch_size: 32
  epochs: 5
  precision: bf16
  label_smoothing: 0.1
  weight_decay: 0.01

estimates:
  parameters: 125M
  vram_training: 8GB
  vram_inference: 4GB
  time_estimate: 4-6 hours
The Knowledge Base
TransXlab needs a knowledge base of:

Knowledge	Source
Model architectures	Transformer, BERT, T5, GPT, Llama patterns
Hyperparameter heuristics	Literature + empirical (your $665 in lessons)
Data strategies	Distillation, augmentation, curriculum
Failure modes	Template memorization, mode collapse, divergence
Hardware constraints	VRAM per param, batch size limits
The Vision
Stage	Tool
Design	TransXlab: "What architecture? What data? What config?"
Validate	TransXlab: "Is this setup sane? Preflight check."
Train	TransXform: "Is training progressing correctly?"
Evaluate	TransXform: "Did it actually learn the task?"
From first principles to trained model, with guardrails at every step.

The Sentence
"TransXlab: Don't just check the config. Design the model."

Fine-tuning is choosing a base. Scratch is choosing everything. TransXlab handles both.